Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information

نویسندگان

  • Xiao Luo
  • Maosong Sun
  • Benjamin Ka-Yin T'sou
چکیده

Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data sparseness problem. We select 90 frequent cases of covering ambiguities as the target. The training set includes 77654 sentences, and the test set includes 19242 sentences. The experimental results showed that our model has achieved 96.58% accuracy, outperforming the original form of TFIDF weighting as well as another baseline model, the hidden Markov model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ambiguity Resolution in Chinese Word Segmentation

A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper. The key characteristics of this model are the use of: (i) statistics which can be automatically ...

متن کامل

Unknown Word Identification for Chinese Morphological Analysis ∗

Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. Besides word segmentation, we also need to identify the part-of-speech (POS) tags of the words. The segmentation and POS tagging process are denoted as morphological analysis. During the process of word segmentation, two main problems o...

متن کامل

An integrated approach for Chinese word segmentation

This paper presents an integrated approach for Chinese word segmentation, which can perform disambiguation and unknown word identification simultaneously on the input. In this work, a hybrid model is used to score known word candidates and unknown word candidates equally by incorporating the modified word-formation models (viz. word-juncture models and wordformation patterns) into word bigram m...

متن کامل

Chinese Word Segmentation Based on Contextual Entropy

Chinese is written without word delimiters so word segmentation is generally considered a key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words based on contextual entropy on both sides of a bigram. It is used to capture the dependency with the left and right contexts in which a bigram occurs. Our approach tries to segment b...

متن کامل

Research on Tibetan Automatic Word Segmentation

This paper researches on Tibetan automatic word segmentation. We focus on three key technologies of Tibetan automatic word segmentation: (1) a Tibetan automatic word segmentation approach is proposed, which is taking the advantage of case-auxiliary words and continuous feature. (2) a resolution method of overlapping ambiguity in Tibetan word segmentation is proposed, which is based on forward-b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002